{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tTuYGOlnh117"
   },
   "source": [
    "#  Going Deeper On Molecular Featurizations\n",
    "\n",
    "One of the most important steps of doing machine learning on molecular data is transforming the data into a form amenable to the application of learning algorithms. This process is broadly called \"featurization\" and involves turning a molecule into a vector or tensor of some sort. There are a number of different ways of doing that, and the choice of featurization is often dependent on the problem at hand. We have already seen two such methods: molecular fingerprints, and `ConvMol` objects for use with graph convolutions. In this tutorial we will look at some of the others.\n",
    "\n",
    "## Colab\n",
    "\n",
    "This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Going_Deeper_on_Molecular_Featurizations.ipynb)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 188
    },
    "colab_type": "code",
    "id": "D43MbibL_EK0",
    "outputId": "e7b205ae-9962-4089-d49a-6d0ebe4c8430"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2.6.0.dev'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "!pip install -qq --pre deepchem\n",
    "import deepchem\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "deepchem.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "omxBgQVDh12B"
   },
   "source": [
    "## Featurizers\n",
    "\n",
    "In DeepChem, a method of featurizing a molecule (or any other sort of input) is defined by a `Featurizer` object.  There are three different ways of using featurizers.\n",
    "\n",
    "1. When using the MoleculeNet loader functions, you simply pass the name of the featurization method to use.  We have seen examples of this in earlier tutorials, such as `featurizer='ECFP'` or `featurizer='GraphConv'`.\n",
    "\n",
    "2. You also can create a Featurizer and directly apply it to molecules.  For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Sp5Hbb4nh12C"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0. 0. 0. ... 0. 0. 0.]\n",
      " [0. 0. 0. ... 0. 0. 0.]\n",
      " [0. 0. 0. ... 0. 0. 0.]]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[09:05:59] DEPRECATION WARNING: please use MorganGenerator\n",
      "[09:05:59] DEPRECATION WARNING: please use MorganGenerator\n",
      "[09:05:59] DEPRECATION WARNING: please use MorganGenerator\n"
     ]
    }
   ],
   "source": [
    "import deepchem as dc\n",
    "\n",
    "featurizer = dc.feat.CircularFingerprint()\n",
    "print(featurizer(['CC', 'CCC', 'CCO']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "_bC1mPM4h12F"
   },
   "source": [
    "3. When creating a new dataset with the DataLoader framework, you can specify a Featurizer to use for processing the data.  We will see this in a future tutorial.\n",
    "\n",
    "We use propane (CH<sub>3</sub>CH<sub>2</sub>CH<sub>3</sub>, represented by the SMILES string `'CCC'`) as a running example throughout this tutorial. Many of the featurization methods use conformers of the molecules. A conformer can be generated using the `ConformerGenerator` class in `deepchem.utils.conformers`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4D9z0slLh12G"
   },
   "source": [
    "### RDKitDescriptors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oCfATWYIh12H"
   },
   "source": [
    "`RDKitDescriptors` featurizes a molecule by using RDKit to compute values for a list of descriptors. These are basic physical and chemical properties: molecular weight, polar surface area, numbers of hydrogen bond donors and acceptors, etc. This is most useful for predicting things that depend on these high level properties rather than on detailed molecular structure.\n",
    "\n",
    "Intrinsic to the featurizer is a set of allowed descriptors, which can be accessed using `RDKitDescriptors.allowedDescriptors`. The featurizer uses the descriptors in `rdkit.Chem.Descriptors.descList`, checks if they are in the list of allowed descriptors, and computes the descriptor value for the molecule.\n",
    "\n",
    "Let's print the values of the first ten descriptors for propane."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "colab_type": "code",
    "id": "3dt_vjtXh12N",
    "outputId": "c6f73232-0765-479c-93b0-ba18cbf6f33a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MaxAbsEStateIndex 2.125\n",
      "MaxEStateIndex 2.125\n",
      "MinAbsEStateIndex 1.25\n",
      "MinEStateIndex 1.25\n",
      "qed 0.3854706587740357\n",
      "SPS 6.0\n",
      "MolWt 44.097\n",
      "HeavyAtomMolWt 36.033\n",
      "ExactMolWt 44.062600255999996\n",
      "NumValenceElectrons 20.0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[09:07:17] DEPRECATION WARNING: please use MorganGenerator\n",
      "[09:07:17] DEPRECATION WARNING: please use MorganGenerator\n",
      "[09:07:17] DEPRECATION WARNING: please use MorganGenerator\n"
     ]
    }
   ],
   "source": [
    "from rdkit.Chem.Descriptors import descList\n",
    "rdkit_featurizer = dc.feat.RDKitDescriptors()\n",
    "features = rdkit_featurizer(['CCC'])[0]\n",
    "descriptors = [i[0] for i in descList]\n",
    "for feature, descriptor in zip(features[:10], descriptors):\n",
    "    print(descriptor, feature)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, there are many more descriptors than this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "KfyDpE81h12Q",
    "outputId": "46673131-c504-48ca-db35-5d689e218069"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The number of descriptors present is:  210\n"
     ]
    }
   ],
   "source": [
    "print('The number of descriptors present is: ', len(features))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "41RwzbTth12U"
   },
   "source": [
    "### WeaveFeaturizer and MolGraphConvFeaturizer\n",
    "\n",
    "We previously looked at graph convolutions, which use `ConvMolFeaturizer` to convert molecules into `ConvMol` objects.  Graph convolutions are a special case of a large class of architectures that represent molecules as graphs.  They work in similar ways but vary in the details.  For example, they may associate data vectors with the atoms, the bonds connecting them, or both.  They may use a variety of techniques to calculate new data vectors from those in the previous layer, and a variety of techniques to compute molecule level properties at the end.\n",
    "\n",
    "DeepChem supports lots of different graph based models.  Some of them require molecules to be featurized in slightly different ways.  Because of this, there are two other featurizers called `WeaveFeaturizer` and `MolGraphConvFeaturizer`.  They each convert molecules into a different type of Python object that is used by particular models.  When using any graph based model, just check the documentation to see what featurizer you need to use with it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SF3l5yJ4h12f"
   },
   "source": [
    "### CoulombMatrix\n",
    "\n",
    "All the models we have looked at so far consider only the intrinsic properties of a molecule: the list of atoms that compose it and the bonds connecting them.  When working with flexible molecules, you may also want to consider the different conformations the molecule can take on.  For example, when a drug molecule binds to a protein, the strength of the binding depends on specific interactions between pairs of atoms.  To predict binding strength, you probably want to consider a variety of possible conformations and use a model that takes them into account when making predictions.\n",
    "\n",
    "The Coulomb matrix is one popular featurization for molecular conformations.  Recall that the electrostatic Coulomb interaction between two charges is proportional to $q_1 q_2/r$ where $q_1$ and $q_2$ are the charges and $r$ is the distance between them.  For a molecule with $N$ atoms, the Coulomb matrix is a $N \\times N$ matrix where each element gives the strength of the electrostatic interaction between two atoms.  It contains information both about the charges on the atoms and the distances between them.  More information on the functional forms used can be found [here](https://journals.aps.org/prl/pdf/10.1103/PhysRevLett.108.058301).\n",
    "\n",
    "To apply this featurizer, we first need a set of conformations for the molecule.  We can use the `ConformerGenerator` class to do this.  It takes a RDKit molecule, generates a set of energy minimized conformers, and prunes the set to only include ones that are significantly different from each other.  Let's try running it for propane."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "evLPEI6mh12g",
    "outputId": "c0895d51-a38d-494e-d161-31ce5c421fb3"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of available conformers for propane:  1\n"
     ]
    }
   ],
   "source": [
    "from rdkit import Chem\n",
    "\n",
    "generator = dc.utils.ConformerGenerator(max_conformers=5)\n",
    "propane_mol = generator.generate_conformers(Chem.MolFromSmiles('CCC'))\n",
    "print(\"Number of available conformers for propane: \", len(propane_mol.GetConformers()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It only found a single conformer.  This shouldn't be surprising, since propane is a very small molecule with hardly any flexibility.  Let's try adding another carbon."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of available conformers for butane:  3\n"
     ]
    }
   ],
   "source": [
    "butane_mol = generator.generate_conformers(Chem.MolFromSmiles('CCCC'))\n",
    "print(\"Number of available conformers for butane: \", len(butane_mol.GetConformers()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can create a Coulomb matrix for our molecule."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "pPIqy39Ih12i",
    "outputId": "ca7b18b3-cfa4-44e8-a907-cbffd4e65364"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[[36.8581052  12.48684429  7.5619687   2.85945193  2.85804514\n",
      "    2.85804556  1.4674015   1.46740144  0.91279491  1.14239698\n",
      "    1.14239675  0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [12.48684429 36.8581052  12.48684388  1.46551218  1.45850736\n",
      "    1.45850732  2.85689525  2.85689538  1.4655122   1.4585072\n",
      "    1.4585072   0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 7.5619687  12.48684388 36.8581052   0.9127949   1.14239695\n",
      "    1.14239692  1.46740146  1.46740145  2.85945178  2.85804504\n",
      "    2.85804493  0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 2.85945193  1.46551218  0.9127949   0.5         0.29325367\n",
      "    0.29325369  0.21256978  0.21256978  0.12268391  0.13960187\n",
      "    0.13960185  0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 2.85804514  1.45850736  1.14239695  0.29325367  0.5\n",
      "    0.29200271  0.17113413  0.21092513  0.13960186  0.1680002\n",
      "    0.20540029  0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 2.85804556  1.45850732  1.14239692  0.29325369  0.29200271\n",
      "    0.5         0.21092513  0.17113413  0.13960187  0.20540032\n",
      "    0.16800016  0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 1.4674015   2.85689525  1.46740146  0.21256978  0.17113413\n",
      "    0.21092513  0.5         0.29351308  0.21256981  0.2109251\n",
      "    0.17113412  0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 1.46740144  2.85689538  1.46740145  0.21256978  0.21092513\n",
      "    0.17113413  0.29351308  0.5         0.21256977  0.17113412\n",
      "    0.21092513  0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 0.91279491  1.4655122   2.85945178  0.12268391  0.13960186\n",
      "    0.13960187  0.21256981  0.21256977  0.5         0.29325366\n",
      "    0.29325365  0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 1.14239698  1.4585072   2.85804504  0.13960187  0.1680002\n",
      "    0.20540032  0.2109251   0.17113412  0.29325366  0.5\n",
      "    0.29200266  0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 1.14239675  1.4585072   2.85804493  0.13960185  0.20540029\n",
      "    0.16800016  0.17113412  0.21092513  0.29325365  0.29200266\n",
      "    0.5         0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]\n",
      "  [ 0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.\n",
      "    0.          0.          0.          0.          0.        ]]]\n"
     ]
    }
   ],
   "source": [
    "coulomb_mat = dc.feat.CoulombMatrix(max_atoms=20)\n",
    "features = coulomb_mat(propane_mol)\n",
    "print(features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Uyq3Xk3sh12l"
   },
   "source": [
    "Notice that many elements are 0.  To combine multiple molecules in a batch we need all the Coulomb matrices to be the same size, even if the molecules have different numbers of atoms.  We specified `max_atoms=20`, so the returned matrix  has size (20, 20).  The molecule only has 11 atoms, so only an 11 by 11 submatrix is nonzero."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "P-sGs7W2h12p"
   },
   "source": [
    "### CoulombMatrixEig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9NTjtDUzh12p"
   },
   "source": [
    "An important feature of Coulomb matrices is that they are invariant to molecular rotation and translation, since the interatomic distances and atomic numbers do not change.  Respecting symmetries like this makes learning easier.  Rotating a molecule does not change its physical properties.  If the featurization does change, then the model is forced to learn that rotations are not important, but if the featurization is invariant then the model gets this property automatically.\n",
    "\n",
    "Coulomb matrices are not invariant under another important symmetry: permutations of the atoms' indices.  A molecule's physical properties do not depend on which atom we call \"atom 1\", but the Coulomb matrix does.  To deal with this, the `CoulumbMatrixEig` featurizer was introduced, which uses the eigenvalue spectrum of the Coulumb matrix and is invariant to random permutations of the atom's indices.  The disadvantage of this featurization is that it contains much less information ($N$ eigenvalues instead of an $N \\times N$ matrix), so models will be more limited in what they can learn.\n",
    "\n",
    "`CoulombMatrixEig` inherits from `CoulombMatrix` and featurizes a molecule by first computing the Coulomb matrices for different conformers of the molecule and then computing the eigenvalues for each Coulomb matrix. These eigenvalues are then padded to account for variation in number of atoms across molecules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "ga1-nNiWh12t",
    "outputId": "2df3163c-6808-49e6-dba8-282ddd7fa3c4"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[60.07620303 29.62963149 22.75497781  0.5713786   0.28781332  0.28548338\n",
      "   0.27558187  0.18163794  0.17460999  0.17059719  0.16640098  0.\n",
      "   0.          0.          0.          0.          0.          0.\n",
      "   0.          0.        ]]\n"
     ]
    }
   ],
   "source": [
    "coulomb_mat_eig = dc.feat.CoulombMatrixEig(max_atoms=20)\n",
    "features = coulomb_mat_eig(propane_mol)\n",
    "print(features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SMILES Tokenization and Numericalization\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far, we have looked at featurization techniques that translate the implicit structural and physical information in SMILES data into more explicit features that help our models learn and make predictions. In this section, we will preprocess SMILES strings into a format that can be fed to sequence models, such as 1D convolutional neural networks and transformers, and enables these models to learn their own representations of molecular properties.\n",
    "\n",
    "To prepare SMILES strings for a sequence model, we break them down into lists of substrings (called tokens) and turn them into lists of integer values (numericalization). Sequence models use those integer values as indices of an embedding matrix, which contains a vector of floating-point numbers for each token in the vocabulary. These embedding vectors are updated during model training. This process allows the sequence model to learn its own representations of the molecular properties implicit in the training data.\n",
    "\n",
    "We will use DeepChem's `BasicSmilesTokenizer` and the Tox21 dataset from MoleculeNet to demonstrate the process of tokenizing SMILES."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<DiskDataset X.shape: (6264,), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>\n"
     ]
    }
   ],
   "source": [
    "tasks, datasets, transformers = dc.molnet.load_tox21(featurizer=\"Raw\")\n",
    "train_dataset, valid_dataset, test_dataset = datasets\n",
    "print(train_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We loaded the datasets with `featurizer=\"Raw\"`. Now we obtain the SMILES from their `ids` attributes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['CC(O)(P(=O)(O)O)P(=O)(O)O' 'CC(C)(C)OOC(C)(C)CCC(C)(C)OOC(C)(C)C'\n",
      " 'OC[C@H](O)[C@@H](O)[C@H](O)CO'\n",
      " 'CCCCCCCC(=O)[O-].CCCCCCCC(=O)[O-].[Zn+2]' 'CC(C)COC(=O)C(C)C']\n"
     ]
    }
   ],
   "source": [
    "train_smiles = train_dataset.ids\n",
    "valid_smiles = valid_dataset.ids\n",
    "test_smiles = test_dataset.ids\n",
    "print(train_smiles[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we define our tokenizer and `map` it onto all our data to convert the SMILES strings into lists of tokens. The `BasicSmilesTokenizer` breaks down SMILES roughly at atom level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['C', 'C', '(', 'O', ')', '(', 'P', '(', '=', 'O', ')', '(', 'O', ')', 'O', ')', 'P', '(', '=', 'O', ')', '(', 'O', ')', 'O']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "6264"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer = dc.feat.smiles_tokenizer.BasicSmilesTokenizer()\n",
    "\n",
    "train_tok = list(map(tokenizer.tokenize, train_smiles))\n",
    "valid_tok = list(map(tokenizer.tokenize, valid_smiles))\n",
    "test_tok = list(map(tokenizer.tokenize, test_smiles))\n",
    "print(train_tok[0])\n",
    "len(train_tok)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have tokenized versions of all SMILES strings in our dataset. To convert those into lists of integer values we first need to create a list of all possible tokens in our dataset. That list is called the vocabulary. We also add the empty string `\"\"` to our vocabulary in order to correctly handle trailing zeros when decoding zero-padded numericalized SMILES."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['', '#', '(', ')', '-', '.', '/', '1', '2', '3', '4', '5'] ... ['[n+]', '[n-]', '[nH+]', '[nH]', '[o+]', '[s+]', '[se]', '\\\\', 'c', 'n', 'o', 's']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "128"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "flatten = lambda l: [item for items in l for item in items]\n",
    "\n",
    "all_toks = flatten(train_tok) + flatten(valid_tok) + flatten(test_tok)\n",
    "vocab = sorted(set(all_toks + [\"\"]))\n",
    "print(vocab[:12], \"...\", vocab[-12:])\n",
    "len(vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To numericalize tokenized SMILES strings we create a `str2int` dictionary which assigns a number to each token in the dictionary. We also create the reverse `int2str` dictionary and define the corresponding `encode` and `decode` functions. Finally we `map` the `encode` function on the tokenized data to obtain numericalized SMILES data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "str2int: {'': 0, '#': 1, '(': 2, ')': 3, '-': 4} ...\n",
      "int2str: {0: '', 1: '#', 2: '(', 3: ')', 4: '-'} ...\n"
     ]
    }
   ],
   "source": [
    "str2int = {s:i for i, s in enumerate(vocab)}\n",
    "int2str = {i:s for i, s in enumerate(vocab)}\n",
    "print(f\"str2int: {dict(list(str2int.items())[:5])} ...\")\n",
    "print(f\"int2str: {dict(list(int2str.items())[:5])} ...\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CC(O)(P(=O)(O)O)P(=O)(O)O\n",
      "[19, 19, 2, 24, 3, 2, 25, 2, 16, 24, 3, 2, 24, 3, 24, 3, 25, 2, 16, 24, 3, 2, 24, 3, 24]\n",
      "CC(O)(P(=O)(O)O)P(=O)(O)O\n"
     ]
    }
   ],
   "source": [
    "encode = lambda s: [str2int[tok] for tok in s]\n",
    "decode = lambda i: [int2str[num] for num in i]\n",
    "print(train_smiles[0])\n",
    "print(encode(train_tok[0]))\n",
    "print(\"\".join(decode(encode(train_tok[0]))))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[19, 19, 2, 24, 3, 2, 25, 2, 16, 24, 3, 2, 24, 3, 24, 3, 25, 2, 16, 24, 3, 2, 24, 3, 24]\n"
     ]
    }
   ],
   "source": [
    "train_num = list(map(encode, train_tok))\n",
    "valid_num = list(map(encode, valid_tok))\n",
    "test_num = list(map(encode, test_tok))\n",
    "print(train_num[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, we would like to combine all molecules in a dataset in an `np.array` so they can be served to a model in batches. To achieve that, all sequences have to be of the same length. As in the CoulombMatrix section, we achieve that by appending zeros up to a fixed value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "240"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "max_len = max(map(len, train_num + valid_num + test_num))\n",
    "max_len"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The longest sequence across all Tox21 datasets has length `240`, so we use that as our fixed length. We create a `zero_pad` function,  `map` it to all numericalized SMILES, and turn them into `np.array`s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[19, 19,  2, ...,  0,  0,  0],\n",
       "       [19, 19,  2, ...,  0,  0,  0],\n",
       "       [24, 19, 42, ...,  0,  0,  0],\n",
       "       ...,\n",
       "       [24, 16, 19, ...,  0,  0,  0],\n",
       "       [19, 19,  2, ...,  0,  0,  0],\n",
       "       [19, 19,  2, ...,  0,  0,  0]])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zero_pad = lambda x: x + [0] * (max_len - len(x))\n",
    "\n",
    "train_numpad = np.array(list(map(zero_pad, train_num)))\n",
    "valid_numpad = np.array(list(map(zero_pad, valid_num)))\n",
    "test_numpad = np.array(list(map(zero_pad, test_num)))\n",
    "train_numpad"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check that the zero-padded data still converts back to the correct SMILES string using the `decode` function on a random datapoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cc1cc(C(C)(C)c2ccc(O)c(C)c2)ccc1O\n",
      "Cc1cc(C(C)(C)c2ccc(O)c(C)c2)ccc1O\n"
     ]
    }
   ],
   "source": [
    "idx = np.random.randint(0, train_numpad.shape[0], 1).item()\n",
    "print(train_smiles[idx])\n",
    "print(\"\".join(decode(train_numpad[idx])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The padded data passes the test. It is now in the correct format to be used for training of a sequence model, but it doesn't yet interface nicely with DeepChem's training framework. To change that, we define a `tokenize_smiles` function that combines all the steps spelled out above to process a single datapoint. Additionally, we define a `SmilesFeaturizer` that uses our custom `tokenize_smiles` function in its `_featurize` method and instanciate it as `smiles_featurizer` passing it our `vocab` and `max_len`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tokenize_smiles(x, vocab, max_len):\n",
    "    tokenizer = dc.feat.smiles_tokenizer.BasicSmilesTokenizer()\n",
    "    str2int = {s:i for i, s in enumerate(vocab)}\n",
    "    encode = lambda s: [str2int[tok] for tok in s]\n",
    "    zero_pad = lambda x: x + [0] * (max_len - len(x))\n",
    "    x = tokenizer.tokenize(x)\n",
    "    x = encode(x)\n",
    "    x = zero_pad(x)\n",
    "    return np.array(x)\n",
    "\n",
    "class SmilesFeaturizer(dc.feat.Featurizer):\n",
    "    def __init__(self, feat_func, vocab, max_len):\n",
    "        self.feat_func = feat_func\n",
    "        self.vocab = vocab\n",
    "        self.max_len = max_len\n",
    "        \n",
    "    def _featurize(self, x):\n",
    "        return self.feat_func(x, self.vocab, self.max_len)\n",
    "    \n",
    "smiles_featurizer = SmilesFeaturizer(tokenize_smiles, vocab, max_len)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we use the `smiles_featurizer` to create new Tox21 datasets that contain tokenized and numericalized SMILES in their `X` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[09:24:48] WARNING: not removing hydrogen atom without neighbors\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[19 19  2 ...  0  0  0]\n",
      " [19 19  2 ...  0  0  0]\n",
      " [24 19 42 ...  0  0  0]\n",
      " ...\n",
      " [24 16 19 ...  0  0  0]\n",
      " [19 19  2 ...  0  0  0]\n",
      " [19 19  2 ...  0  0  0]]\n"
     ]
    }
   ],
   "source": [
    "tasks, datasets, transformers = dc.molnet.load_tox21(featurizer=smiles_featurizer)\n",
    "print(datasets[0].X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The datasets are now ready to be used with your custom DeepChem sequence model. Don't forget to wrap your model into the appropriate DeepChem model class."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "wssi6cBmh12z"
   },
   "source": [
    "# Congratulations! Time to join the Community!\n",
    "\n",
    "Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:\n",
    "\n",
    "## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)\n",
    "This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.\n",
    "\n",
    "## Join the DeepChem Gitter\n",
    "The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pOBd6-YdQSvF"
   },
   "source": [
    "## Citing This Tutorial\n",
    "If you found this tutorial useful please consider citing it using the provided BibTeX. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "KZUk_9yIYw0c"
   },
   "outputs": [],
   "source": [
    "@manual{Intro7, \n",
    " title={Going Deeper on Molecular Featurizations}, \n",
    " organization={DeepChem},\n",
    " author={Ramsundar, Bharath}, \n",
    " howpublished = {\\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Going_Deeper_on_Molecular_Featurizations.ipynb}}, \n",
    " year={2021}, \n",
    "} "
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "06_Going_Deeper_on_Molecular_Featurizations.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
